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Abstract 

In parameter determination for the heteroskedastic probit model, both in simulated 
data and in actual data collected by Alvarez and Brehm] (1995), we observe a failure of 
traditional local search methods to converge consistently to a single parameter vector, 
in contrast to the typical situation for the regular probit model. We identify features 
of the heteroskedastic probit log likelihood function that we argue tend to lead to this 
failure, and suggest ways to amend the local search methods to remedy the problem. 
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A common difficulty encountered in the estimation of statistical models is that of unequal 
variances or heteroskedasticity. In the context of ordinary least squares, heteroskedasticity 
does not bias parameter estimates; rather, it either inflates or underestimates the standard 
errors. Heteroskedasticity, however, is more problematic in discrete choice models such as 
logit or probit and their ordered and multinomial variants. If we have nonconstant variances 
in the error term of a discrete choice model, not only are the standard errors incorrect, but 


the parameters are also biased and inconsistent (Yatchew and Griliches 1985). 


Alvarez and Brehm (|1995) generalize a basic model for heteroskedasticity developed by 


Harvey (1976). They call these heterogenous choice models, which include heteroskedastic 


probit and heteroskedastic ordered probit models to correct for unequal variances with dis¬ 
crete outcomes. They also used heteroskedasticity as a means of exploring heterogeneity in 
choice situations. These heteroskedastic probit models have been frequently used to explore 


heterogenous choices and behaviors (Alvarez and Brehm 1997, 1998, 2002 Busch and Rein¬ 


hardt 1999 Gabel 1998; Lee 2002; Krutz 2005). Routines for these models have become 


standard in statistical software such as Stata, Limdep, SAS, Eviews and Shazaam. 

We use simulations to explore the optimization difficulties that can arise during the 
estimation of these models. In our simulations, the models are correct. Thus, the anomalies 
that we find cannot be explained by specification error. We argue that estimation difficulties 
are due to the functional form of these models. In this paper, we observe the following 
dichotomy: the usual algorithms for optimization of regular probit log likelihood often fail 
with the heteroskedastic probit model. It is likely that the problems with these model are 
under-reported in the literature, since the errors that are encountered may give rise to the 
“hle-drawer” problem where difficulties with these models go unreported as these papers are 
sitting in investigators’ file drawers ( [Iyengar and Greenhouse||1988 ). The point of this paper 
is, first, to give evidence of that dichotomy, second, to attempt to identify those features of 
the heteroskedastic likelihood function that lead to the failure, and third to suggest ways a 
local search might be adapted to those features. 

To this end, in section [T] of this paper, we give a brief exposition of the regular and het- 
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eroskedastic probit models. In section we investigate the performance of “out-of-the-box” 
optimization techniques with the heteroskedastic probit model, hrst through simulated data, 
and second through a data set collected by Alvarez and Brehm (1995). In section]^ we delve 
more deeply into our simulated data to attempt to gain some intuition about the graph of the 
likelihood function, and thus about the failure of local search methods, in the heteroskedastic 
case in general. And in section we suggest a modihcation to tailor optimization methods 
to the heteroskedastic probit model, and offer additional recommendations. 


1 A Probit Model with Heteroskedasticity 

We first define the model, beginning with the standard probit model. Fix a constant 
fci—dimensional vector f3, and assume that, given an observed value Xj of a fci—dimensional 
vector of random variables and an observed value e* of a random variable distributed normally 
with mean 0 and variance a latent random variable y* takes the valu^ 


y* = x'/3 + e,. 

This value is not seen by the researcher. A researcher observes only the value 

[ 1 if 2/* > 0 
Vi = \ 

\ 0 if y* < 0. 

One then readily calculates that 

Pr(|/i = 1 I Xi) = <F , 

^To be completely precise, we note that we think of vectors as column vectors. 
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where $ is the cumulative distribution function for the standard normal distribution. Only 

the parameter —f3 is identihed, so typically one assumes 
a 


a = 1 


for purposes of identification. 

In the heteroskedastic probit model, one allows a to take values other than 1, by esti¬ 
mating a model for its value. This may arise in many situations, for example, in a model of 
personal choice where levels of information vary across individuals. A convenient form for 
modeling a is 

a = exp(z' 7 ), 

where z is a A; 2 “dimensional random vector and 7 is a A; 2 “dimensional parameter vector. 
Thus the model can be written as 


Pr(l/i = 1) 


$ 


x'/3 


exp(z' 7 ) 


A natural way to think of this model, as above, is as consisting of a latent variable given by 


y* = 1^(3 + e. 


where 

e I z ~ A/” (0, exp( 2 z' 7 )). (1) 

But one can also think of it as a model with a latent variable with mean -^-- and 

exp(z' 7 ) 

a disturbance term distributed as a standard normal variable. We also observe that there 
can be conditions in which the model is not identified, for example if z consists only of a 
constant. 

One can estimate this model using maximum likelihood, much as one does for the probit 
model. Assume that there are n observations of ?/*, x* and Zj. For convenience, let y be the 
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n X 1 vector of all observations of yi, let X be the n x ki matrix whose row is x' and let 
Z be the n x k 2 matrix with row given by z'. Then the log likelihood function is given 
by: 


' (/3,7 I y, ^, ^) = X] ( Vi In $ 


2 = 1 


x'/3 


exp(z'7) 


+ (1 -|/i)ln 1 - $ 


x'/3 


exp(z'7) 


( 2 ) 


Given a set of n observed values of the random vectors y, x and z, in order to obtain the 
maximum likelihood estimate, one maximizes this function over the space of possible choices 
of (/3,7) G 


2 Search Algorithm Performance 

We now examine how standard optimization algorithms perform when maximizing the het- 
eroskedastic probit log likelihood function. 

2.1 Simulated Data: Heteroskedastic Probit Model 

We hrst consider the behavior of search techniques when employed on simulated data sets. 
We note that this and other data sets used in the paper are not designed to be pathological. 
In this section, we describe the simulated data generating process and then use four different 
search methods on this one data set. 

The basic data generating process is as follows. First, we generate 1000 observations of 
Xj and Zj, with components equal to 0 or 1. Then we generate a particular choice for both 
/3 and 7 that we call the model parameters, which we denote by /3 q and 7 q. Given these 
choices, for each observation i, we generate a disturbance term e* sampled from the normal 
distribution with mean 0 and variance exp(z' 7 Q). Then we consider each sum 

x'/3o + Ci. (3) 

If this sum is greater than or equal to 0, we dehne yi to be 1, and we dehne y^ to be 0 if the 
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sum is less than 0. In this fashion, we generate the parameters /3 q and 7 q, and the data set 

1 < z < 1000}. 

Now consider one potential complication. Suppose that for all z, the term z' 7 q is large 
and negative, so that the variance exp(z' 7 g) is quite small. In this case, it could be that 
the term x'/3g in the sum (|^ above is much larger in magnitude than that of the term e* 
for all observations, and thus may completely determine the value of z/j. We refer to this as 
a situation in which there is no crossover. Now dehne l(z;) = 1 for n > 0 and 0 for n < 0. 
Then, more formally, we say that there is crossover for observation z if 


that is, if the disturbance term leads to yi having a different value from that which the linear 
term x'/3q alone would predict. If all of the observations exhibit no crossover, then it seems 
intuitively likely that there will be no optimal choice for 7 ; the basic idea is the following]^ 
Suppose, to the contrary, that there is an optimizing argument and suppose for example 
that it occurs when /3 = /Sg. For those observations with either x'/3q > 0 or x'/3g < 0, 
because the components of z are nonnegative, one can observe from Equation (|^ that the 
summands of the log likelihood function will generally increase as the components of 7 grow 
large and negative]^ For this reason, and more importantly to focus on the class of models 
for which the variance portion of the model plays an important role, we ensure that in the 
data generating process there is a signihcant percentage of crossover. 

We now turn to the details of simulating the data. We take the dimension of x to 
be three and of z to be two. The hrst component of x is a constant equal to one, while 
the second and third components, as well as the two components of z, are realizations of 
Bernoulli variables with probability 1/2, i.e. discrete uniform {0,1} random variables. We 


^This is of course exactly similar to the complete separation problem for the probit model. For one 
discussion of the complete separation problem, see Albert and Anderson (1984). Separately, we note that 
our purpose here is not to provide a complete proof, but only to explain the reasoning behind the choices in 
our data generating process. 

^We assume here for simplicity that many of the components of the terms are strictly positive. 
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generate 1000 such observations Xj and z*. As mentioned above, we refer to the parameters 
used in the data generating process as the model parameters and denote them by /3o and 79 . 
The components of these parameters are themselves sampled from a (continuous) uniform 
distribution on [—5,5], although there are important provisos to this statement. For one, 
we select the second and third component of /3o from [—5,5], but after doing so we in fact 
choose the hrst component so that the mean of x'/3q over all observations is 0. The purpose 
of this is to avoid the situation in which x'/3g is, say, positive for all or most of the data set; 
unless the variance is fairly large, we might expect to see very little crossover. The second 
caveat to the statement that the parameters are sampled from a uniform distribution is that 
we discard any choices (/3o)7o) of parameters for which the crossover is less than 20% 
or greater than 30%; i.e., we simply repeat the random sampling until we hnd such a choice 
of parameters]^ Having selected (3q and 79 , then for every observation indexed by i, we 
sample a disturbance term e* from a (pseudo-)normal distribution with mean 0 and variance 
exp(z' 7 g). Then we define 


Vi = 1 (x'/3o + e*). 

This gives the complete set of simulated data {xj,Zj,?/j ; 1 < i < 1000}. Now we fix one 
data set simulated in this fashion, and consider the performance of different algorithms when 
maximizing the log likelihood on this data setj^ 

We hrst present results for the algorithm commonly referred to as the BFGS algorithm 

In each of 1000 

^The percentages chosen here are somewhat arbitrary, but reflect a desire for the average summand in 
the log likelihood function to be somewhat better (less negative) but not too much better than what turns 
out to be an essentially minimal value of log(l/2). If there is too much crossover, this average summand 
(which we later refer to as the normalized value of the log likelihood function) will generally be very close 
to log(l/2), whereas if there is very little crossover, it will generally be much better. In the former case, 
the value of the log likelihood at the model parameters is not much different from its value at what we call 
below the plateau solution, so it is hard to demonstrate the difference between the two. In the latter case, 
the model approaches a condition much like the case of complete separation in the probit model, so we want 
to avoid this situation. Recall as well that our purpose in this paper is merely to demonstrate that under 
reasonable conditions, there are concerns with optimization when using the heteroskedastic probit model, 
as opposed to showing that there are concerns under any conditions. Also, we note that we discuss why the 
value log(l/2) arises more below. 

®The model parameters for this data set, to two decimal places, are given by /3g = (.56, —.31, —.84) and 
7o = (3.43,-4.07). 


Broyden 

1970 

Fletcher 

1970; 


Fletcher and Reeves 1964 Shanno 1970 
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runs, we randomly select a starting value for the parameter (/3o)7o) ^e estimated. Each 
choice of initial values has components in the interval [—5,5], so that each choice of initial 
values is in the box [—5, 5]^. Denote the estimate resulting from the run by 


lBFGS,i ) i 


and the set of all such estimates by 

EbFGS = I {pBFGS,i-:lBFGS,^ i 1 < * < lOOoj . 

We are interested in how close the estimates are to the model parameters. We employ 
two measures to measure the proximity of the estimates in Ebfgs to the model parameter 
values. The hrst measure is the simple Euclidean distance between each estimated parameter 
value and the model parameter value. The second measure starts by looking at the difference 
between the value of the log likelihood function F at the model parameter vector and the 
value of the log likelihood function F at each estimated parameter vector. We divide this 
difference by the number of observations in the data set. The idea here is to normalize the 
difference in order to better compare results using this data set with later results using a 
real data set, which has a different sample size. Here the number of observations is 1000, 
which we denote by hsim for expositional purposes. 

We present histograms of these two measures over all 1000 runs. We look at two his¬ 
tograms, for frequencies observed in each of the following two sets, which correspond to the 
two measures described above: 


D 


BFGS — 


{PBFGS,ii^BFGS,i) (/^O) 7o) 1 < * < lOOoj , 


where || II 2 denotes Euclidean distance, and 


Vbfgs — 


nsiM L 




, I <i< 1000 



We note that -F((/3 q,7q)) = —0.45946. There is in fact an added complication for the 

nsiM 

histogram for Vbfgs, which we discuss momentarily. 

The two histograms are displayed in Figure The left histogram plots frequencies for 
the set DbfgSi ^md the right histogram plots frequencies for the set Vbfgs- The left plot 
uses a log scale on the horizontal axis, to more clearly present key phenomena. The right plot 
also uses a log scale on the horizontal axis, but only in the right section of the graph. There 
are actually almost 300 choices of starting values for which the log likelihood values at the 
estimates found using the BFGS algorithm are actually greater (i.e., less negative) than the 
value at the model parameters. This is not that surprising- we are using a hnite sample, and 
obviously not a population. For the log likelihood function, which is the empirical analogue 
of the expectation of a generic summand over the population, we do not have any reason to 
believe that the log likelihood should be maximized at the model parameters; on the other 
hand, under suitable conditions we might expect that the expectation would be maximized 
at the model parameters]^ 

Naturally, these negative differences would not show up on the graph if we were to use a 
log scale, given that the log scale would only allow representation of those observations for 
which the log likelihood has a value at the estimated parameters that is in fact strictly less 
than the value of the log likelihood at the model parameters. We represent the observations 
for which the value of the log likelihood at the estimated parameters is actually greater than 
the value of the log likelihood at the model parameters in the graph in the following way. 
We plot a rectangle on the left side of the plot with height equal to the number of these 
cases and labeled as “Values Better than at Model Parameters.”]^ As a hnal note concerning 
the two plots in Figure ]T| observe that the scales for the horizontal and vertical axes differ 
by plot. 

When starting with random initial values, the parameter estimates do not generally 
appear to be close to the model parameters, for about 70% of the 1000 runs. In fact, the 


®See, for example, Lemma 14.1 of Ruud (2000), for a statement of a result of this type concerning the 
expectation. 


^Its location along the horizontal axis is of course not meaningful. 
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Simulated Data: Histogram of Euclidean Distances Between 
Estimated and Model Parameter Values, Using BEGS (Log Scale) 
(Using 1000 Choices of Initial Values) 


Simulated Data; Histogram of Differences of Normalized 
Values of Log Likelihood Function at Various Estimated Parameter 
Values and Normalized Values at Model Parameters 
Using BFGS (Log Scale) (Using 1000 Choices of Initial Values) 



0.1 1 10 100 1000 10000 


Euclidean Distance (Log Scale) 



Values Better than 
at Model Parameters 


0.01 


0.1 


Normalized Value at Mode! Parameters Minus 
Normalized Value F at Estimate, l.e.: 
-0.45946 - F; (Uses Log Scale) 


Figure 1: Performance of the BFGS Search Algorithm on a Simulated Data Set for the 
Heteroskedastic Probit Model, for 1000 Random Choices of Initial Values 

Euclidean distance between the parameter estimates and the model parameters is at least 
3.16 (= 10'^) for all but 15 of the 1000 runs and at least 10 for 790 of the runs. Moreover, 
for more than 600 runs, the value of the log likelihood function at the estimated parameter 
values is at least 223.87 less than the value at the model parameters, which is —459.46]^ 
Thus by two measures the BFGS algorithm appears to only perform well for some fraction 

®The value .22387 equals which is the left endpoint of the rightmost histogram cell in the plot 

on the right in Figure [l] recall that this value was obtained by normalizing, i.e. by dividing by 1000. 


10 































of starting values chosen randomly from a box, at least for this data set. 


Simulated Data; Histogram of Euclidean Distances Between 
Estimated and Model Parameter Values, Using CG (Log Scale) 
(Using 1000 Choices of Initial Values) 



0.1 1 10 100 1000 10000 

Euclidean Distance (Log Scale) 


Simulated Data: Histogram of Differences of Normalized 
Values of Log Likelihood Function at Various Estimated Parameter 
Values and Normalized Values at Model Parameters 
Using CG (Log Scale) (Using 1000 Choices of Initial Values) 



at Model Parameters 


0.00001 0.0001 0.001 0.01 0.1 1 10 
Normalized Value at Model Parameters Minus 
Normalized Value F at Estimate, i.e.: 

-0.45946 - F; (Uses Log Scale) 


Simulated Data; Histogram of Euclidean Distances Between 
Estimated and Model Parameter Values, Using Nelder-Mead (Log Scale) 
(Using 1000 Choices of Initial Values) 



0.1 1 10 100 1000 10000 

Euclidean Distance (Log Scale) 


Simulated Data: Histogram of Differences of Normalized 
Values of Log Likelihood Function at Various Estimated Parameter 
Values and Normalized Values at Model Parameters 
Using Nelder-Mead (Log Scale) (Using 1000 Choices of initial Values) 



at Modal Parameters 

0.00001 0.0001 0.001 0.01 0.1 1 10 
Normalized Value at Model Parameters Minus 
Normalized Value F at Estimate, i.e.: 

-0.45946 - F; (Uses Log Scale) 


Figure 2: Performance of the CG and Nelder-Mead Search Algorithms on a Simulated Data 
Set for the Heteroskedastic Probit Model, for 1000 Random Choices of Initial Values 

Figure presents similar plots for the same simulated data set, but with two other 
optimization techniques. The simulations were run so that the same 1000 choices of initial 
parameter vector were made for each search technique. (To be clear, these are also the same 
initial choices used for the BFGS search technique.) We combine results for two separate 
algorithms in one hgure. The top two plots present measures of the quality of estimates 


11 
















































found from implementing a conjugate gradients method (referred to as CG) developed by 


Fletcher and Reeves (1964), while the bottom two plots display results found using a method 


developed by Nelder and Mead (1965). For these hgures, we adjust the plots so that the 
horizontal and vertical scales are the same within each column (but not within each row). 

The hrst row considers the CG method. Looking only at the right histogram, we see that 
at least half of the initial values result in parameter estimates that give log likelihood values 
more than 100 (= .1 x 1000) less than the log likelihood value at the model parameters. The 
second row reports on the performance of the Nelder-Mead algorithm. It actually performs 
fairly well, but the right histogram shows that there are roughly 300 starting values at which 
the log likelihood is more than 31.6 (= 10’ -1.5 X 

lOOO) less than the log likelihood value at 

the model parameters. 

We also implemented a simulated annealing algorithm, specihcally a variant given by 


Belisle (1992), which we refer to in the paper as SANN. Due to its less common usage in 


political science, we present those results in the appendix. (See Figure 12 ) Briefly, the 
SANN algorithm performs very well, with all but 41 of the 1000 initial values leading to 
estimates for which the log likelihood values are greater than the log likelihood values at the 
model parameters. 

In addition, we looked at a few alternative situations.For one, we considered the case of 
continuous values of the data for the z variables, rather than the discrete values used in the 
simulated data set discussed earlier. (We still take the components of z to be nonnegative.) 
Running search algorithms on a simulated data set with continuous z variables led to prob¬ 
lems like those found above, namely estimated parameters far from the model parameters, 
and values of the log likelihood function at the estimated parameters far from its values at 
the model parameters]^ 

We also looked at some simulations in which the model parameters 7 were a bit smaller 
in absolute value. Recall that the model parameter 7 q in the data set above is (3.43, —4.07). 


®We note that our intuitive explanation given in Section [3. 1 1 indicates that these problems would arise for 
both discrete and continuous z. In fact, that reasoning might lead us to believe the problems could be even 
worse for continuous z, given that we would expect far fewer occurrences of z = 0 in the continuous case. 
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The reason for doing so is that one could be concerned that this value would lead to a large 
variance (exp(3.43)) in the case that z = (1,0) and a small variance (exp(—4.07)) in the case 
that z = (0,1); on the other hand, in the case that z = (1,1), the variance (exp(3.43 — 4.07)) 
would not be as large or small. There were similar problems with the search algorithms for a 
data set with 7 q = (—0.6, 0.84, —0.69, —0.15, —0.16,0.42), although the problems did occur 
with less frequency (as a fraction of the number of choices of initial values)]^ We also note 
that below we hud similar problems with a real data set used by Alvarez and Brehm, which 
has a parameter 7 whose components are each smaller than 1 in absolute value p] 

2.2 Simulated Data: Probit Model 

We now contrast the results for the above search algorithms for the heteroskedastic probit 
model with their performance for a probit model with the same number of parameters. 

Thus, we consider a probit model with dim(/3) = 5, so that the dimension of the param¬ 
eter vector is the same in both cases. We again generate a pseudorandom data set with 1000 
observations x*, with components equal to 0 or 1 (and with the first component always equal 
to 1). Then we generate a choice of (3q by selecting the second through fifth components 
of /3o from [—5,5], but we choose the hrst component so that the mean of x'/Sg over all 
observations is 0. In the probit setting, we do not make any requirements about the fraction 
of crossover. We again generate 1000 choices for initial parameters in the box [—5, 5]® and 
start each search technique at all 1000 of these choices. 

Results for the BFGS search algorithm are presented in Figure]^ All 1000 initial values 
lead to estimates at which the log likelihood is greater than at the model parameters. More¬ 
over, all of the estimates are in fact within .233 of the model parameters. These graphs can 
be compared directly with Figure 

For the CG and Nelder-Mead search algorithms, the results are similar. We present the 
results in Figure]^ which can be compared directly with FigureFor the GG algorithm, as 

note that, in very limited testing, we did not find any problems with local search for simulated data 
sets with dim( 7 g) = 2. 

^^These are not exactly a model parameters in this case, but is analogous- see below. 
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Simulated Data: Histogram of Euclidean Distances Between 
Estimated and Model Parameter Values, Using BFGS (Log Scale) 
(Using 1000 Choices of Initial Values) 
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Simulated Data: Histogram of Differences of Normalized Values 
of Log Likelihood Function at Various Estimated Parameter 
Values and Normalized Values at Model Parameters 
Using BFGS (Log Scale) 
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Figure 3: Performance of the BFGS Search Algorithm on a Simulated Data Set for the 
Probit Model, for 1000 Random Choices of Initial Values 


with the BFGS method, all 1000 initial values lead to estimates at which the log likelihood 
is greater than at the model parameters. The Nelder-Mead algorithm leads to some results 
that are not as good; however, all but 19 initial values lead to estimates at which the 
log likelihood is better than at the model parameters]^ The performance of these search 
algorithms when applied to the probit model is far better than the analogous performance 
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The results for the SANN algorithm are presented in Figure 13 in the Appendix. 
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Simulated Data; Histogram of Euclidean Distances Between 
Estimated and Model Parameter Values, Using CG (Log Scale) 
(Using 1000 Choices of Initial Values) 
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Simulated Data; Histogram of Differences of Normalized Values 
of Log Likelihood Function at Various Estimated Parameter 
Values and Normalized Values at Model Parameters 
Using CG (Log Scale) (Using 1000 Choices of Initial Values) 
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at Model Parameters 
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Simulated Data; Histogram of Euclidean Distances Between 
Estimated and Model Parameter Values, Using Nelder-Mead (Log Scale) 
(Using 1000 Choices of Initial Values) 
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Simulated Data: Histogram of Differences of Normalized Values of Log 
Likelihood Function at Various Estimated Parameter Values and Normalized Values 
at Model Parameters,Using Nelder-Mead (Log Scale) 

(Using 1000 Choices of Initial Values) 



at Model Parameters 
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Normalized Value at Model Parameters Minus 
Normaiized Vaiue F at Estimate, i.e.: 

-0.17127 - F; (Uses Log Scaie) 


Figure 4: Performance of the CG and Nelder-Mead Search Algorithms on a Simulated Data 
Set for the Probit Model, for 1000 Random Choices of Initial Values 

for the heteroskedastic probit model. 


2.3 Alvarez-Brehm Data: Heteroskedastic Probit Model 


We now consider the behavior of search techniques when employed on a data set used by 


Alvarez and Brehm (1995) in their study of attitudes toward abortion. We use the same 


specihcation as Alvarez and Brehm, namely a heteroskedastic probit model using a constant 
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and seven variables in what they call the “choice model” (informally, the part depending on 
/3, so that the dimension of (3 is eight), and six variables in the “variance model” (loosely, 
the part depending on 7 , so that the dimension of 7 is six). Alvarez and Brehm actnally nse 
these explanatory variables in seven models, one for each of seven ontcome variables. Here 
we focns on one of these ontcomes: whether or not participants think that abortion should 


be legal if a woman wants it for any reason ^ In the data, we remove any observations for 
which one of the explanatory or outcome variables is missing; in so doing, we match Alvarez 
and Brehm in using 1295 observations. 

As above, we are interested in how close the estimates are to optimal in some sense. 
For operational purposes, we take the optimal value to be the estimates found by Alvarez 
and Brehm, which we refer to as the A-B parameter values. Denote this set of parameter 
values by We now present histograms of the two measures used above to 

assess the performance of the search methods. Here, for the normalized measure, we will 
divide by uab = 1295, the number of observations in the Alvarez-Brehm data. We note that 
-F((/ 3 ^ 5 , 7 ^^)) = —0.59383. We begin with the BFGS algorithm. The two histograms 


nAB 


are displayed in Figure]^ Both plots use a log scale on the horizontal axis, to more clearly 
present key phenomena; as above, this log scale only applies to the right portion of the right 
plotp^ 

Recall that, viewing the Alvarez-Brehm parameters as correct in some sense, we expect 
the log likelihood to be in general greater (i.e. negative and having less magnitude) at these 
parameters than at our estimates. However, for roughly 300 choices of starting values, the 
values of the log likelihood at the estimates found using the BFGS algorithm are actually 
greater than the value at the Alvarez-Brehm parameters. This is most likely due to the 
fact that the parameters we use as the Alvarez-Brehm parameters are from their paper, and 


^^See Alvarez and Brehm (1995) for more details. See Table 1 of their paper for results for the seven 
models. 

^^These estimates can be found in the final column of Table 1 of their paper: they are 
= (-.07, -.15, -.13, .05, -.22, -.79, .12, .51) and 7^3 = (-.22, -.48, .22, -.30, . 68 , .63). 

^^Observe that, as with the graphs for the simulated data, the scales for the horizontal and vertical axes 
differ by plot. 
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Alvarez-Brehm Data: Histogram of Euclidean Distances Between 
Estimated and A-B Parameter Values, Using BFGS (Log Scale) 
(Using 1000 Choices of Initial Values) 


Alvarez-Brehm Data: Histogram of Differences of Normalized 
Values of Log Likelihood Function at Various Estimated Parameter 
Values and Normalized Values at A-B Parameters 
Using BFGS (Log Scale) (Using 1000 Choices of Initial Values) 



0.01 0.1 1 10 100 1000 10000 


Euclidean Distance (Log Scale) 



at A-B Parameters 

0.0001 0.001 0.01 0.1 

Normalized Value at A-B Parameters Minus 
Normalized Value F at Estimate, i.e.: 
-0.59383 - F; (Uses Log Scale) 


Figure 5: Performance of the BFGS Search Algorithm on the Alvarez-Brehm Data (Using 
the Heteroskedastic Probit Model), for 1000 Random Choices of Initial Values 


reported only to two digits]^ Thus we are probably essentially getting the same parameter 
estimates as they are, as casual inspection of a few of the better estimates indicates. As 
above, we represent this in the graph by plotting a rectangle on the left side of the plot 
with height equal to the number of these cases and labeled as “Values Better than at A-B 

^®Note that this is unlike the situation for the simulated data discussed above, as here we would expect 
that Alvarez and Brehm found the estimates that maximize the log likelihood function, rather than finding 
the estimates that maximize the analogous expectation, which would typically not be maximized at the 
model parameters used to generate a simulated data set. 
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When starting with random initial values, the parameter estimates only appear to be 
close to the A-B parameters for about 30% of the 1000 runs. In fact, the Euclidean distance 
between the parameter estimates and the A-B parameters is at least 10 for 683 of the 
1000 runs. Moreover, the normalized value of the log likelihood function at the estimated 
parameter values is at least .063 less than the normalized value of the log likelihood function 
at the A-B parameters for more than 600 of the runs; thus the (unnormalized) value of 
the log likelihood function is at least 81.5 less than the (unnormalized) value at the A-B 


parameters for these runs.^ Thus by these two measures the BEGS algorithm appears to 
perform well for only a fraction of the randomly chosen starting values. 

Figure presents similar plots, again for the Alvarez-Brehm data set and starting at 
each of the same 1000 choices of initial values, but now we look at the CG and Nelder-Mead 
algorithms. As above, for these hgures, we adjust the plots so that the horizontal and vertical 
scales are the same within each column (but not within rows). 

From the right plot in the top row, one can see that at least roughly 600 of the results 
from the GG method appear to be fairly far from optimal, as judged using the value of 
the log likelihood function at these estimates. There are almost about 100 choices of initial 
value however that yield estimates better than the A-B parameter values by this measure, 
and a few that are only slightly worse (about 10“® or 10“® larger). From the left plot in the 
bottom row, all but about 100 of the estimated parameters from the Nelder-Mead method 
have Euclidean distance at least 1 from the A-B parameter values; this may not necessarily 
be indicative of poor estimates, but from the right plot we can see that at least about 
100 estimates are not close, in that the normalized values are at least 0.1 larger than the 
normalized value at the A-B parameters. (So the log likelihood function is at least 12.95 
larger.) In the Appendix, we look at the performance of the SANN search algorithm on the 


A-B data. (See Figure 14 ) The results are neither clearly better nor worse than those for 


^^Again, its location along the horizontal axis is of course not meaningful. 

^®The value .063 equals 10“^'^, which is the left endpoint of the rightmost histogram cell in the plot on 


the right in Figure the value 81.5 is just less than the product 1295 x .063. 
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Atvarez-Brehm Data: Histogram of Euclidean Distances Between 
Estimated and A-B Parameter Values, Using CG (Log Scale) 
(Using 1000 Choices of Initial Values) 



Alvarez-Brehm Data: Histogram of Differences of Normalized 
Vaiues of Log Likelihood Function at Various Estimated Parameter 
Vaiues and Normaiized Vaiues at A-B Parameters 
Using CG (Log Scale) (Using 1000 Choices of initial Values) 
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-0.59383 - F; (Uses Log Scale) 


Alvarez-Brehm Data: Histogram of Euclidean Distances Between 
Estimated and A-B Parameter Values, Using Nelder-Mead (Log Scale) 
(Using 1000 Choices of Initial Values) 



Alvarez-Brehm Data: Histogram of Differences of Normalized 
Values of Log Likelihood Function at Various Estimated Parameter 
Values and Normalized Values at A-B Parameters 
Using Nelder-Mead (Log Scale) (Using 1000 Choices of Initial Values) 



0.01 0.1 1 10 100 1000 

Euclidean Distance (Log Scale) 


0.001 0.01 0.1 1 10 

Normalized Value at A-B Parameters Minus 
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-0.59383 - F; (Uses Log Scale) 


Figure 6: Performance of the CG and Nelder-Mead Search Algorithms on the Alvarez-Brehm 
Data (Using the Heteroskedastic Probit Model), for 1000 Random Choices of Initial Values 

the other search techniques. 
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3 Observations on the Shape of the Log Likelihood 


Function for the Heteroskedastic Probit 

Next, we explore why optimization algorithms might perform poorly when applied to the 
heteroskedastic probit likelihood. We start our exploration with a prohle plot of the log 
likelihood function for the heteroskedastic probit using the the Alvarez-Brehm data. We 
hrst £x a parameter estimate arising from the BFGS algorithm for a particular initial value, 
which we have chosen to have a resulting parameter estimate at which the value of the log 
likelihood is not close to its value at the Alvarez-Brehm estimates. We display a graph 
that demonstrates how the log likelihood varies as the hrst two components 71 and 72 of 7 
varyp^ The value of (3 and the other components of 7 are hxed at the estimated values. 
Figure [^presents this plot, from four different vantage points. In order to make the graph 
readable, we restrict the range of output values plotted to be those values at least — 10 , 000 . 
At the values of (71,72) for which no output values are plotted, the log likelihood function 
is actually much less than — 10 , 000 . 

We see a distinct shape in this graph: a plateau, which ranges roughly over positive 
values of (71,72)- This gives a sense of the plateau but investigation of the graph restricted 
to a smaller range of output values shows that it has a more complex structure. For example, 
there is a ridge for 72 hxed at a certain value a little less than zero, which declines in 71 . 
If this plateau were in some sense hat enough, an optimization algorithm that is essentially 
local in nature may not be able hnd the actual maximum of the function if it starts at an 
initial value somewhere on the plateau. Additionally, it may be led to this plateau from 
other regions and then subsequently remain there. 

^®Throughout Section]^ for ease of notation, we often use (3 and 7 where /3 and 7 might perhaps be more 
appropriate. 
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Figure 7: 

Graphs of (Yi,Y2)-Slice of Log Likelihood Function from Different Perspectives, 
for a Restricted Range of Output Values, for the Alvarez-Brehm Data Set at Estimated Parameters 






Notes: Only output values greater than or equal to -lOOQ^O are graphed. The value of the log likelihood at the A-B Parameters Is: -769.01. The estimated parameters are: 

p=(-65.34,-55.53,-40.91 ,-7.87,2,41 ,-34.08,78.8,62.68). 

V=(50.44,38,75,108,81,77.17,23.58,-22,99), 

The Initial values are: 

pini,={-0.06,-0.96,-0.21,0.99,0.8,-0.27,0.17,0.83). 

Yinit=(-0.82,-0.34,0.05,-0.26,0.73,0,93), 


3.1 The Heteroskedastic Probit Log Likelihood Function 


Next we discuss why we might expect a plateau region of the sort seen in Figure Consider 
the form of the log likelihood function for the heteroskedastic probit, given in Equation (|^. 
In this section, we focus on the case in which z is discrete and, for all i with 1 < i < n, all 


of the components of Zj are nonnegative. For convenience, we denote this by the shorthand 


Z > 0, 
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where we are thinking of Z as above, i.e. as the n x ^2 matrix with row given by z'. In 
the Alvarez-Brehm data used above, this holds, and it is also the case in the simulated data 
in Section We will return to consideration of more general z later. We will also assume 
in this section that there are some values i for which Zj = 0 . 

Then, except in the case that Zj = 0, for values of 7 with components that are positive 
and sufficiently large, we can ensure that z'7 is large and positive. Of course for such i, we 

x^. ^ 

also have that exp(z'7) is large, whence the term -)——- will be close to 0 . Thus, under 

exp(Zi7) 

these conditions we have 


' (/3,7 I y, ~ (2/iln<h(x'/3) + (l-j/i)ln(l-<I)(x'/3))) 

Zi=0 

+ (i/Ond>(0) + (l-|/,)ln(l-<I)(0))) 

l<i<n: Zij^O 

- ^ + (1 - Vi) In (1 - <!> (x'/3))) 


Zi=0 


-(In2) X 


{zi: Zi^O, 1 <i <n} 


( 4 ) 


From the form of the expression in (|^, we can see why there is a plateau in the graphs 
in Figure Consider a choice 7^^ = (711,712, 71^2) with large positive components. 

Throughout the remainder of the paper, we at times abuse language and refer to such 
vectors 7 informally as simply “large and positive.” Then, for 7 = (711,712, • • • ,71^2) near 
7i, the denominator will still be large and positive, so that x'/ 3 / exp(z'7) will be small 
and relatively unaffected by small changes in (3 and 7. 

To see this in more detail, assume that there is a choice of parameters 7^^) and some 
large value of L > 0 so that 7^ satisfies 


min 7y > L. ( 5 ) 


Recall that we have assumed that the z’s are discrete and have nonnegative components. 
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Now assume for convenience that in fact the smallest positive value of the components of 
the z’s is 1 . 

Consider a new choice of parameters (/ 32 , 12 ) with 73 = ( 721 , 722 , • • •, 72 fc 2 ) satisfying, say, 

\l 2 j - lij\ <1 for 1 < j < ^ 2 - ( 6 ) 


Then (|^ and ([^ together imply that 


min 72 ,' > L — 1 . 


Also assume that if > 0 satishes 


x'/3]^| < H and |x'/ 32 | < H for 1 < i < n. 


Now we consider terms of the log likelihood function for which z* is not zero. Fix any 7 
with 1 < ii < n satisfying z,^ 7 ^ 0 . As we have assumed that z is discrete with a minimum 
nonzero value of 1 , it follows that z '^72 > L — 1 , and thus exp(z'^ 72 ) > exp(L — 1 ) holds. 
Therefore one has 


exp(z'^7^.) 


< He^-^ 


for j 


1 and j = 2 . 


So certainly for L large relative to H, 


exp(z'^7i) 


and 


X7/32 

exp(z'^72) 


will both be small in magnitude, and thus both $ and (1 — $) evaluated at these expressions 
will be very close to 1/2 (as *F(0) = 1 / 2 ). So the values of the summand of the log 
likelihood at the two parameter choices (/ 3 ^, 7 i) and (/ 32 , 72 ) will clearly be very close to 
each other; that is, the term of the log likelihood function will not change substantially 
for small perturbations of 7 with large, positive components. Of course, for i such that 
Zj = 0, the term of the log likelihood could change in a substantial way if 7 is on the 
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plateau. But in certain situations, these terms may be a small portion of the total number 
of terms, and thus may not have a strong effect on the value of the sum as a whole. 

Now consider an optimization algorithm used to maximize the log likelihood. Suppose 
the algorithm reaches a point (/3,7) on the plateau, i.e. with 7 having large and positive 
components. As seen above, any small changes in 7 , regardless of the change in f3, will not 
substantially affect the terms indexed by i for which z* 7 ^ 0. We propose that the effect 
of changes in the remaining terms might not be enough to guide the algorithm to the best 
value; any improvements in the terms indexed by those i for which Zj = 0 that might be 
caused by veering off of the plateau could perhaps be outweighed by increased penalties in 
the far greater number of terms indexed by i such that z* 7 ^ 0 . 

Before turning to another topic, we make another observation. As discussed above, if 7 


has large and positive components, then for i with Zj 7 ^ 0, the expression 


x'/3 


will be 


close to 0. Thus $ 


x'/3 


exp(z' 7 ) 


and 1 — <h 


x'/3 


exp(z'7) 


exp(z'7) 
will be close to 1/2. Hence the 


terms in the log likelihood indexed by i for which Zj is nonzero will be close to — In 2. Thus, 
we expect to see values of the log likelihood function close to —nln 2 at the estimates on 
the plateau, especially if there are very few terms with Zj = 0. Thus, in some sense, we can 
also think of —n In 2 as a sort of benchmark for a value that the optimization algorithm 
should be able to easily achieve. Observe that — In 2 —0.693. One can compare the values 

obtained in the simulation to this value. For example, consider the right plot in Figure 
The normalized value of the log likelihood function at the A-B parameters is —.59383, as 
can be read off from the horizontal axis of the plot. There are more than 600 initial values 
from which the BFGS search method results in estimates at which the normalized value 
of the log likelihood is almost .1 less than the value at the A-B parameters, which gives a 
normalized value on the order of — In 2 . As we expect for the reasons outlined here, it was 
in fact common in our various computations for many of these normalized values to be fairly 
close to — In 2 at plateau solutions. 

Considering this benchmark gives another way to think intuitively about the benefit to an 
optimizer of choosing a large positive value for 7 . The function ln(<F(a:)) decreases rapidly 
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for X negative and decreasing. (For example, In ($(—10)) ~ —53.) So, in some sense, the 
penalty for missing by a lot for one observation can be very large, whereas the gain from 
improving the other observations may not be that large. So it may be benehcial to simply 
make every term £t at least somewhat decently, and a large positive value for 7 allows us to 
at least limit these penalties to around —0.693 each. This is obviously not a rigorous line of 
reasoning, but it gives a heuristic for why search techniques might often end with a plateau 
solution. 


3.1.1 Graphical Interpretation: The Basic Probit Model 

Now we give a graphical interpretation that sheds light on the underlying mechanics of 
how optimization algorithms perform, when used to maximize the log likelihood function 
associated with the heteroskedastic probit model. To develop the interpretation, we hrst 
consider a probit model with two-dimensional parameter /3, so that Pr(y = 1 | x) = x'/3. 
For a hxed choice (3 G Figuredepicts several observations xi,X 2 ,... ,X 5 G and the 
hyperplane x'/3 = 0 , which of course in R^ is simply a line. Around each observation x*, we 
draw a dashed line segment perpendicular to the line x'/3 = 0, extending in either direction 
from the point x*; the endpoints are those points Vji and Vj 2 on this perpendicular line that 
satisfy 

Vii/3 = x'/3 - 1 and = x'/3 + 1 . 


This dashed line represents adding an error term e* to the linear combination x'/3. The 
values of Vf3 for vectors v correspond to values of the latent variable y* = x'/3 -|- e* for 

kil < 1. 


The region {x ; x'/3 > 0} is, in this example, the upper right half plane in Figure 
because the value of /3 we use is (1.5,1). This region roughly corresponds to the region 
with those x* for which the associated yi equals 1 , while the lower left half plane is the 
region {x ; x'/3 < 0 }, which corresponds roughly to the region with y^ = 0 ; note that we 
say “roughly” because one could have what we have termed crossover, Crossover occurs 


ignore points satisfying x'/3 = 0 here as its consideration would only complicate the discussion, 
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Xi 

Figure 8: A Graphical Representation of the Probit Model 

if either (i) x'/3 > 0 holds yet = 0, or (ii) x'/3 < 0 holds yet yi = 1. For an example, 
consider X 2 in Figure]^ One in fact has X 2/3 = —0.5. If we assume that 62 = 0.75, then the 
latent variable = X 2/3 + 62 = 0.25, so 7/2 = 1, and thus for this value of 62 , observation 2 
exhibits crossover. 

Now consider, in terms of Figure what choices might be made by an optimization 
algorithm used to maximize the value of the log likelihood for the probit model. Despite 
which we give only for the sake of intuitive understanding, and moreover would not add any material value. 
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the fact that generally an algorithm is used to maximize the log likelihood, for exposition 
purposes, we will generally speak of an “optimizer” as if an individual were attempting to 
maximize the log likelihood. So consider the task facing this optimizer. One key goal, given 
data {(r/j,Xj)}i<j<„, is to choose /3 so that most Xj’s are on the correct side of x'/3 = 0, 
so to speak; i.e., one would choose x* to be in the half-plane x'/3 > 0 if r/j = 1 and in the 
half-plane x'/3 < 0 if r/j = 0. In other words, an optimizer would aim to minimize crossover. 
Of course, an optimizer need not do so exact lyp] We can think of each summand of the 
log likelihood function as assigning a penalty if there is crossover; of course, there is also 
a summand that is added even if there is no crossover for the observation, but all of these 
summands are negative and we think of most of them as having small magnitude, so that 
intuitively, an optimizer might be mainly concerned with the summands for which there is 
in fact crossover. These penalties for crossover are determined as follows. First, consider 
what we have just noted: this penalty is always negative. For the points Xj that are far from 
the line x'/3 = 0, the penalty for crossover in the log likelihood function is much larger (in 
magnitude) than the penalty for points Xj near the line. For example, the term of the log 
likelihood function for i with j/j = 1 is ln<F(x'/3), which is of course much more negative, the 
more negative is the value of x'/3. So, for example, if j/i = j /2 = 1 were to hold, then given 
their positions as shown in Figure it is more important to put xi in the correct half plane 
than X 2 , as the penalty ln<h(X|/3) from not doing so is much larger (in magnitude) than the 
corresponding penalty ln<h(x 2 / 3 ). 

3.1.2 Graphical Interpretation: The Heteroskedastic Probit Model 

Now we consider the heteroskedastic probit model, and, for simplicity, we examine the case 
in which dim(x) = 2 and dim(z) = 2, i.e. for / 3,7 G The picture here includes a 
few modihcations to Figure Again, the objective of an optimizer involves choosing (3 to 
minimize crossover but now with an additional tool, namely the optimizer can choose 7 so 
as to affect the penalties associated with crossover. Figure presents the graphs. The top 

^^And indeed it is probably extremely unlikely with real-world data to be able to do so, and if so, there 
would be the problem of complete separation. 
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Xl 



Figure 9: A Graphical Representation of the Heteroskedastic Probit Model 

graph is similar to the plot in Figure]^ Again, the line x'/3 = 0 separates the observations 
Xl,... ,X 5 into two half-planes. Also, we again draw dashed line segments around each x*. 




in the direction perpendicular to the line x'/3 = 0. But now these line segments extend for 
lengths that vary for each observation; this length, for the point x*, varies with exp(z' 7 ); 
much as in the probit graph, the endpoints of the line segment are the points Vji and Vj 2 
satisfying 

Vii/3 = x'/3 - exp(z' 7 ) and v' 2/3 = x'/3 + exp(z' 7 ). (7) 


Recall from ([^ that exp(z' 7 ) is the standard deviationj^ of the error term that forms 
part of the definition of the latent variable y*. For those observations i with exp(z' 7 ) large, 
speaking very loosely, the model assigns a larger probability of crossover (speaking very 
loosely again), as the error term is more likely to be larger, as represented in the picture by 
a longer dashed line segment. 

Consider our earlier framework of thinking of each summand of the log likelihood function 
as giving a penalty for crossover (and contributing a small negative number to the sum if 
not). Recall from (|^ that this penalty is always negative. Suppose that there is crossover 
for observations 1 and 2; from the graph, we can see that these two points are in the region 
{x : x'/3 < 0}, so in other words, given that we are assuming there is crossover, we have 
2/1 = 1/2 = From (|^, the penalty for crossover, as i/i = 1/2 = 1, is 


In 



exp(z' 7 )yy 


for i = 1,2, 


which is, speaking very loosely, smaller (in magnitude), the larger is exp(z' 7 ) (as x'/3 < 0 


for i = 1,2.). Of course, the size of the penalty also depends on the magnitude of x'/3, and 

x'/3 


more precisely on the ratio 


It is not immediately clear from the graph, but for our 


exp(z' 7 )' 

purpose here of providing exposition, one can perhaps see that it makes some intuitive sense 


that this penalty would be smaller (in magnitude) for observation 1 than for observation 2 
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^^The decision to use the standard deviation is somewhat arbitrary; we are merely showing one pictorial 
representation of the main concepts. 

^^As with the probit graph, one needs to know the value of /? to determine which side of the line is the 
region {x : x'f3 < 0} and which side the region {x : x.'f3 > 0}. Here we have in fact chosen (3 = (—1.2,1). 
Obviously had we chosen (1.2, —1), or some positive scalar multiple of it, the regions would be reversed. 

^^To see this precisely, first recall that we have x(/3 < 0 and X2,3 < 0. The fact that the dashed line for 
observation 1 crosses the line x'j3 = 0 shows, from Q, that exp(z( 7 ) > |x(/3|, whence x(/3/ exp(z( 7 ) > — 1 
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Thus a maximizer gets penalized much less for choosing /3 so that xi is on the wrong side 
of the dividing line than for choosing (3 so that X 2 is on the wrong side. 


Now consider the bottom graph in Figure]^ which depicts the observations Zj (1 < i < 5). 
Also we see the hyperplane z '7 = 0 (here a line, as we have assumed above that 7 is two- 
dimensional). Here the upper right half represents the region {z ; z '7 > 0}, and the lower 


left half plane the region {z : z '7 < 0}, ° Note that there is no special signihcance of the line 


z '7 = 0 analogous to the signihcance of the line x'/3 = 0 , but we include it as an indicator 
of the value of z' 7 . Observe that those observations i with z '7 positive and large will have 
exp (z' 7 ) large and thus have a longer associated dashed line segment in the top graph of 
Figure 

An optimizer, then, attempts to simultaneously choose two lines (or, more generally, two 
hyperplanes): one to split the Xj’s into two regions to try to give observations a positive 
value for x'/3 ii yi = 1 and a negative value if = 0, and the second to arrange the Zj’s so 
that (loosely speaking) the smaller is z' 7 , the less likely observation i is to exhibit crossover. 


Now observe what happens for an optimizer who chooses a plateau solution. In this 

case, we return to the earlier assumption that Z > 0 (which is of course not the case in the 

bottom graph in Figure]^, and that there are some observations i with Zj = 0. For this 

holds. On the other hand, the dashed line for observation 2 does not cross the line x'/3 = 0. Thus 
exp(z 27 ) < |x 2/3|, whence X 2 / 3 /exp(z 27 ) < —1 holds. Therefore we have X 2 / 3 /exp(z 27 ) < x(/3/exp(z'j 7 ) 
and thus In ($ (x^/S/ exp(z 27 ))) < In ($ (x'j/3/ exp(z( 7 ))). So the penalty for observation 2 is more negative 
than that for observation 1, i.e., the penalty for observation 1 is smaller (in magnitude) than the penalty for 
observation 2. 

We present an example showing why the perpendicular distance from the end of the dashed line to the 
line x'/3 = 0 does not completely determine the penalty, although it may seem at first that this might be 
the case. Consider two observations indexed by a and b with xJj/3 = 5 and exp(zjj 7 ) = 1, and x(,3 = 25 
and exp(z( 7 ) = 15. For each of these two observations, as x'/3 > 0 holds for i = a,b, the endpoint of the 
dashed line that is closest to the line x'/3 = 0 is Vi^ in the notation of Q, and thus the distance from the 
end of each dashed lines to the line x'/3 = 0 is determined by x'/3 — exp(z' 7 ). Set Ui = x'/3 — exp(z' 7 ). This 
distance is a strictly increasing function in Ui with value d{ui); i.e. it varies in an increasing fashion (albeit 
nonlinearly) with Ui. Now for observation a, this distance is d(A) = d{5 — 1), whereas it is d{l0) = d{25 — 15) 
for observation b. The latter is larger, so one might be tempted given the way we have represented the model 
graphically to believe that the penalty would be smaller for observation b, as the dashed line is farther from 
the line x'f3 = 0. But in fact, the ratio x'^/ exp(z' 7 ) is larger for j = 5 than it is for i = 1, as 15/25 > 1/5, so 
the penalty for observation b is in fact larger. We could have drawn the dashed lines with length equal to the 
above ratio, but despite the flaws, we decided to draw the dashed lines as here, as it gives a representation 
of the range of values for the latent variable y*, and because other alternatives seemed less natural and less 
apt to provide the reader with graphical intuition. 

^^Here we have chosen 7 = (1, 5). 
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plateau solution, the optimizer chooses 7 with large and positive components]^ The two 
planes for an illustrative case are presented in Figure First look at the bottom panel. 
Observations with Zj = 0, which we label zi, Z 2 and Z 3 for expositional convenience, of course 
lie directly on the line, while all of the other observations (represented here by unlabeled 
points) he in the region z '7 > 0 and away from the line. In the top graph, we see that only 
the points xi,X 2 and X 3 have small associated dashed line segments, while all others have 
large associated segments. (Note that some of these segments extend outside of the region 
of the graph.) So, by virtue of choosing 7 with large, positive components, the optimizer 
has ensured that only the observations i = 1,2,3 matter very much for choosing the line 
x'/3 = 0, i.e. for choosing (3. 


4 Potential Improvement and Other Suggestions 


We now propose a straightforward method to mitigate the problems we have identified in 
the heteroskedastic probit model. We suggest ensuring that the model simply satishes the 
property that at least one of the components of the vectors in Zj take both signs. We would 
hope for this to avoid the plateau problem, or at least the exact form of it described above. 
Observe, however, that we do not claim that if one merely does so then maximizing the log 
likelihood function for the heteroskedastic probit model is robust; we are only stating that 
this is likely a helpful modihcation. Here we assume that there is no constant term in z 
If there were a constant term in z*, say zn, one could simply choose the parameter 71 to be 
positive and very large relative to the other components of Zj and thus ensure that exp(z' 7 ) 
is always large, thus undoubtedly leading to a similar plateau problem. 

We give a demonstration of the potential efficacy of our solution, by comparing the 
performance of an optimization algorithm on (i) a data set with Z > 0 and (ii) a translation 
of this first data set that ensures that the observations Zj take both signs. The first data set 


w 


^®The choices we have made for these graphs, for ease of illustration, are /3 = (—1.5, —5) and 7 = ( 8 ,2). 
^^There may or may not be a constant term in x here. In general, there should of course not be a constant 
term in both x and z, as the model would not be identified. 
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is created as described above, and, for example, is forced to satisfy the requirement that the 
percentage of observations with crossover is between 20% and 30%. Denote the data set by 
V. Let n be the number of observations in the data set. V consists of the parameters /3g 
and 7 q and the observations y, X and Z, so we write 

Also dehne ki = dim(/3Q) and /c 2 = dim( 7 Q). Observe that we have dim(y) = n, while 
dim(X) = n X ki and dim(Z) = n x k 2 . Write 

7 o = (701,702, • • • ,70^2)- 

For convenience, write 

1 =( 1 , 1 ,..., 1 ). 

"-V-' 

length k2 

Now we dehne the transformed data set V. Let 

A = l^exp ^ ^ j j /3o, 

To = 7 o, 

y = y, 

Xj = Xj for 1 < i < n, 

11 

Zi = Zi - -1 = Zi - - ( 1 , 1 , - •, 1) I <i <n, 

length k2 

and 

^ = {3o,7o,y,x,z| . 

We note for emphasis that we have dim {j3q^ = and dim ( 7 g) = A; 2 , and also dim (y) = n, 
while dim = n x ki and dim (z\ = n x k 2 - 
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Now, consider, for each i, the expressions 


x'/3 


and 


x'/3 


These appear as 


exp(z' 7 ) exp(z' 7 )' 

terms in each summand of the respective log likelihood functions for the heteroskedastic 

probit model, for the original and transformed data sets. We claim that for all i, the value 

of this term for the original data set equals its value for the transformed data set at each 

data set’s respective model parameters: for we have 


exp(i'^) 


(•■■(-s?)) 

(•■■(-sd) 

exp (z'7o) ■ 


exp (^(zi - |l)'7o) 

_^<^0_ 

exp (z'7o)exp (-|l'7o) 


Together with the observation that y = y, we can see that the summand of the log 
likelihood function for the original data set at its model parameters (/3o)7o) is equal to 
the summand of the log likelihood function for the transformed data set at its model 
parameters Jq). In other words, writing y = {yi, 1 / 2 , • • •, Vn) and y = {yi, ^ 2 , • • •, Vn), we 
have: 




x'/3 


exp(z' 7 ) 


+ (1 - l/i) In 1 - $ 


X-/3 


y* ln<h 


exp(z'7) 


exp(z'7o) 


+ (1 - jji) In - $ 
for 1 < i < n. 




exp(z'7o) 


It of course follows immediately that the respective log likelihood functions at the respective 
model parameters are equal, i.e. that 



Now we turn to some details. As described above, we create a pseudorandom set V. We 
then generate 1000 random choices of initial values in the box [—5, 5]®, and measure the search 
performance of the BFGS algorithm in estimating the model parameters (/3o)7o)- Then we 
transform the data set V as described above to get a new data set V. We then transform 
these 1000 new random choices of initial values (in the same way we transformed (3 in the 
equations above, keeping 7 unchanged), and subsequently measure the search performance 
of the BFGS algorithm in approaching its model parameters ^/3q,^q 
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We present the results in Figure 11 First consider the left graphs. The top left plot 
displays a histogram of the Euclidean distances between the estimated and model parameter 
values for the 1000 runs for the original data set V; the bottom left plot displays the analogous 


histogram for the transformed data set V. Glearly the performance in the latter case is far 
superior. Now consider the right column of graphs. As with earlier plots, we also display 
histograms of the (signed) differences between the normalized values of the log likelihood at 
the model and estimated parameters. (Recall that the former might generally be expected 
to be greater, i.e. a smaller negative number.) The top graph corresponds to the original 
simulated data set; the bottom plot to the transformed data set. Horizontal and vertical 
axes in both graphs have the same scale. Again, in both graphs, we have also plotted a bar 
that represents the number of choices of initial values for which the log likelihood value at 
the resulting estimates is actually larger than the log likelihood at the model parameters]^ 
There are many more of these for the transformed data set, and in fact for the transformed 
data set, there are only three choices of initial values (out of 1000 ) for which the log likelihood 
at the estimated parameters is worse than (i.e. less than) the log likelihood at the model 
parameters. Glearly, by this method of comparison as well, the BFGS search algorithm 
performs much better on the transformed data set. 

Naturally, we have only presented evidence for one simulated data set and one search 


^®We also ran a simulation on the same data set but where we chose different initial values for each of the 
original and transformed data set. The results are very similar. 

^^Observe that the plots have the same scales on the horizontal and vertical axes. 

^°As above, the horizontal position of this bar is of course not meaningful. 
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Original Simulated Data Set: Histogram of Euclidean Distances Between 
Estimated and Model Parameter Values, Using BFGS (Log Scale) 
(Using 1000 Choices of Initial Values) 



0.1 1 10 100 1000 10000 

Euclidean Distance (Log Scale) 


Transformed Data Set: Histogram of Euclidean Distances Between 
Estimated and Model Parameter Values, Using BFGS (Log Scale) 
(Using 1000 Choices of Initial Values) 



0.1 1 10 100 1000 10000 


Euclidean Distance (Log Scale) 


Original Simulated Data Set: Histogram of Differences of Normalized 
Values of Log Likelihood Function at Various Estimated Parameter 
Values and Normalized Values at Model Parameters 
Using BFGS (Log Scale) (Using 1000 Choices of Initial Values) 



at Model Parameters 

0.01 0.05 0.1 0.5 1 

Normalized Value at Model Parameters Minus 
Normalized Value F at Estimate, i.e.: 

-0.39348 - F; (Uses Log Scale) 


Transformed Data Set: Histogram of Differences of Normalized 
Values of Log Likelihood Function at Various Estimated Parameter 
Values and Normalized Values at Model Parameters 
Using BFGS (Log Scale) (Using 1000 Choices of Initial Values) 



at Model Parameters 

0.01 0.05 0.1 0.5 1 

Normalized Value at Model Parameters Minus 
Normalized Value F at Estimate, i.e.: 

-0.39348 - F; (Uses Log Scale) 


Figure 11: Comparison of Performance of the BFGS Search Algorithm for a Simulated Data 
Set and a Transformation of that Data Set (Using the Heteroskedastic Probit Model), for 
1000 Random Choices of Initial Values 

algorithm]^ But we have given reasons why this transformation may mitigate the plateau 
problem]^ 

also performed similar simulations for the CG, Nelder-Mead and SANN algorithms. The CG algo¬ 
rithm performs much better for the transformed data set than for the original data set, and the Nelder-Mead 
method performs better but not strongly so; the SANN algorithm on the other hand performs fairly well for 
both data sets. 

32lt may be of interest to discuss, in terms of a search algorithm, how the transformation affects the 
plateau problem. Note that parameter values on the plateau for the original set, i.e. with large values of 7 , 
correspond, under the transformation, to values of /3 close to 0; this can be seen from the definition of the 
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4.1 Additional Suggestions 


In addition to the above modification, we make a few other suggestions that may be helpful 
for researchers to bear in mind when using a search algorithm to estimate a heteroskedastic 
probit model. 

One thing we suggest is that a researcher should try multiple different starting values 
for search algorithms. One could then check if one gets a stable set of resulting parameter 
estimates; if not, caution is called for. This is presumably a good approach in general, 
but it appears to be especially warranted for the heteroskedastic probit model. As well, 
if a researcher uses a search algorithm, and finds an estimate 7 that has large positive 
components, there should be concern that it is a plateau solution; in line with the previous 
suggestion, the researcher should try many other initial values. If that does not improve the 
situation, results should be interpreted with caution. The simulated annealing algorithm 
(SANN in this implementation) is, in some respects, similar in spirit to the notion of starting 
search algorithms at multiple initial values, in that it incorporates random jumps in the 
procedure. In some of the simulations SANN performed much better than some of the other 
algorithms; however, it did not perform as well as other methods on the Alvarez-Brehm data. 
So a simulated annealing algorithm might be a useful search method for the heteroskedastic 
probit. We emphasize that we do not view a simulated annealing algorithm as a panacea 
by any means, and we are not implying that simply using a simulated annealing method 
suffices to address the problems we have identified in the heteroskedastic probit model. 

Finally, we make an observation regarding the special case in which x and z are inde¬ 
pendent. From Q, we can see that intuitively, for positive and large choices of 7 , it seems 
that the choice of /3 would be determined by the optimal choice of (3 for the subset of i for 
which Zj = 0. If X and z are independent, then the observations Xj for i in this subset are 
a random sample, and thus we would expect the estimate of /3 q to be good in some sense 
if the estimate 7 is positive and large. Thus this presents a potential (informal) method of 

transformation. Thus, small jumps in a local search on the transformed set would lead a search algorithm 
away from 0, which would correspond to escaping the plateau in the parameter space for the original data 
set. 
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estimation for the heteroskedastic probit in this special case: first, find an estimate /3 and 
if the associated estimate 7 is positive and large, then, second, £x this value of (3 and use a 
search algorithm to estimate only the 79 parameter]^ 


5 Final Remarks 

Here we have explored how commonly used optimization algorithms perform when applied 
to the heteroskedastic probit likelihood. Using both simulated data sets and in a re-analysis 
of the seminal work by Alvarez and Brehm, we find that some optimization algorithms can 
often converge at incorrect solutions. We compare these same algorithms when applied 
to standard probit log likelihood functions and hnd no such problems. We also sketch a 
heuristic argument for why these difficulties may be inherent in the heteroskedastic probit 
likelihood. Our work constitutes a first step, by calling attention to some potential problems 
in using the model. But more remains to be done to understand the particular functional 
form. However, analysts that estimate heteroskedastic probit models should at the very least 
ensure that a variety of starting values converge to the same set of estimates. They may 
also want to perform a transformation of the data and parameters in the fashion we have 
described. It would also be advisable to try more robust optimization algorithms such as 


simulated annealing or a genetic optimization algorithm (Sekhon and Mebane 1998) 


alternative method might involve starting with a positive and large initial value for 7 . Also, we have 
observed the estimate of (Bq to be close to the model or A-B parameters in multiple simulations. But we do 
not investigate making either of the proposed methods more formal, as this is not the focus of this paper. 
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A Appendix: Graphs for the SANN Search Algorithm 


Figure 12: Performance of the SANN Search Algorithm on a Simulated Data Set for the 
Heteroskedastic Probit Model, for 1000 Random Choices of Initial Values 


Simulated Data: Histogram of Euclidean Distances Between 
Estimated and Model Parameter Values, Using SANN (Log Scale) 
(Using 1000 Choices of Initial Values) 
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Simulated Data: Histogram of Differences of Normalized 
Values of Log Likelihood Function at Various Estimated Parameter 
Values and Normalized Values at Model Parameters 



at Model Parameters 

0.01 0.1 1 

Normalized Value at Model Parameters Minus 
Normalized Value F at Estimate, i.e.: 
-0.45946 - F; {Uses Log Scale) 
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Figure 13: Performance of the SANN Search Algorithm on a Simulated Data Set for the 
Probit Model, for 1000 Random Choices of Initial Values 


Simulated Data: Histogram of Euclidean Distances Between 
Estimated and Model Parameter Values, Using SANN (Log Scale) 
(Using 1000 Choices of Initial Values) 



Simulated Data; Histogram ot Ditterences ot Normalized Values 
of Log Likelihood Function at Various Estimated Parameter 


Values and Normalized Values at Model Parameters 
Using SANN (Log Scale) 




Values Better than 
at Model Parameters 


Euclidean Distance (Log Scale) 
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Figure 14: Performance of the SANN Search Algorithm on the Alvarez-Brehm Data, for 
1000 Random Choices of Initial Values 


Alvarez-Brehm Data: Histogram of Euclidean Distances Between 
Estimated and A-B Parameter Values, Using SANN (Log Scale) 
(Using 1000 Choices of Initial Values) 


Aivarez-Brehm Data: Histogram of Differences of Normalized 
Values of Log Likelihood Function at Various Estimated Parameter 
Values and Normalized Values at A-B Parameters 
Using SANN (Log Scale) (Using 1000 Choices of Initial Values) 
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Normalized Value at A-B Parameters Minus 
Normalized Value F at Estimate, i.e.: 
-0.59383 - F; (Uses Log Scale) 
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